FORECASTING CASE STUDY - RETAILER X (ANISH)

Instructions

The following case study is intended to test how you would structure a solution for demand forecasting.

You have been given a dataset for a retailer X who has stores across the country. Please suggest an approach or design a mix of approaches for forecasting sales at the category store week level.

Data available :

1).   Macro_Economic_Data - Outlet  YEARWK  Temp    Gas_price_in_dollars 
Consumer_price_Index    UnEmployment_rate

2).   Outlets -  Outlet Outlet_Type Floor_Size

3).   Holiday_Calendar - YEARWK Is_A_Holiday_Flag

4).   Heirarchy -  Dept Category

5).   Train data - Outlet   Category    YEARWK  Weekly_Sales

6).   Test Data - Store Category    YEARWK


There may be different types of time series - Stable, Sporadic, With lesser weeks of history etc. Please walk us through your approach to solve the exercise for both short term forecasting (<=6 weeks) and long term forecasting (6+ weeks)

You may use the colab notebook to create charts, do EDA etc to prepare your final forecasting data.

Please do appropriate research to design the solution. More than one type of modeling may need to be used to solve the case study depending on the following:

  1. Forecasting horizon
  2. Available Sales History
  3. Sales patterns (sporadicity, stability)
  4. Availability of additional features

Points to note

  • Please highlight relevant research/EDA (if any) to support your solution design.

  • While the data is provided to support the modeling exercise end to end, it is not expected that you develop the models. That is optional.

  • The case study will be evaluated on the solution design and suggested approaches.

Expected Output

The output should be the python code for all the EDA, time series plots etc to arrive at the design of the overall solution and the overall recommendation of the solution .

The deepnote notebook can be used to do all the EDA you want to do.

In [ ]:
# import pandas as pd
# df = pd.read_csv('Data_Forecasting/Train_data.csv')
# df
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-2c8878b3e3f4> in <module>()
      1 import pandas as pd
----> 2 df = pd.read_csv('Data_Forecasting/Train_data.csv')
      3 df

/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

/usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: 'Data_Forecasting/Train_data.csv'





Solution

Author: Anish Mahapatra
Email: anishmahapatra01@gmail.com

Aim

Create a demand forecasting approach or mix of approaches for forecasting sales at a store-category-week level

To do this, we have been given five relevant datasets

  • Macro_Economic_Data - Macro economic data like Gas Price, Temperature, COnsumer Price Index and Unemployment rate at a Outlet and Year-week level
  • Outlets - Further information on the outlet like Floor Type and Outlet Type (A, B, C)
  • Holiday_Calendar - Information on the week(s) that have a holiday
  • Train_Data - Data at a store-category-week level (Weekly_Sales is the predictor variable)
  • Test_Data - Store-Category-Week data is given, and we need to forecast the Weekly_Sales

Now, we need to understand what are the possible approaches we can take-up with this.

Steps

  • Join relevant tables (Train_data, Macro_Economic_Data, Holiday_Calendar & Outlets)

  • Perform data investigation

  • Look for missing data, outliers
  • Visualize the data and understand the time horizon
  • Map out the Seasonality, Trend and noise in the data
  • Perform Exploratory Data Analysis (Distribution plots, univariate analysis)
  • Perform Bivariate analysis (Correlation)
  • Understand if exponential smoothing, differencing and log transformations are needed for a stationary timeseries
  • Perform forecast for the relevant time horizon
  • Visualize data
  • Give out the approaches in a neat and concise format.

Entity Relation (ER) Diagram

(Made by Author - Anish via draw.io) ERDig_v1

1 Data understanding

Back to Table of Contents


In [ ]:
# Installing the required packages

! pip uninstall -y pandas-profiling &> /dev/null                # Package for pandas profiling - visualization
! pip install pandas-profiling[notebook,html] &> /dev/null      # Uninstalling and reinstalling it due to a bug in Google Colab
! pip install sweetviz &> /dev/null                             # Package for some sweet visualizations
! pip install -U dataprep &> /dev/null                          # Package for instant data preparation
! pip install --upgrade plotly &> /dev/null                     # Updating plotly to the latest version
! pip install jupyter-dash &> /dev/null                         # Installing jupyter-dash 
! pip install pydotplus &> /dev/null                            # Visualization library
! pip install eli5 &> /dev/null
! pip install lime &> /dev/null
! pip3 install catboost &> /dev/null
In [ ]:
# Importing the required packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas_profiling                             # Automatic EDA
import sweetviz as sv                               # Importing sweetviz for some sweet visualizations
import seaborn as sns                               # Importing seaborn for visualization
import pandas as pd                                 # Importing pandas
import numpy as np                                  # Importing numpy
%matplotlib inline    
import warnings                                     # Importing package to toggle warnings
import IPython                                      # Importing ipython for displaying html files in the notebook
import seaborn as sns
import pandas as pd
import numpy as np
import datetime
import os

# Hide warnings
import warnings
warnings.filterwarnings('ignore')

# Removing the minimum display columns to 500
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
In [ ]:
holiday_calendar = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/holiday_calendar.csv')
Macro_Economic_Data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Macro_Economic_Data.csv')
Outlets = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Outlets.csv')
Train_Data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Train_data.csv')
Test_data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Test_data.csv')
In [ ]:
Train_Data.head()
Out[ ]:
Outlet Category YEARWK Weekly_Sales
0 1 1 201006 24930.50
1 1 1 201007 46045.49
2 1 1 201008 41601.55
3 1 1 201009 19409.54
4 1 1 201010 21833.90
In [ ]:
Macro_Economic_Data.head()
Out[ ]:
Outlet YEARWK Temp Gas_price_in_dollars Consumer_price_Index UnEmployment_rate
0 1.0 201006.0 43.36 2.662 212.096358 7.606
1 1.0 201007.0 39.56 2.638 212.242170 7.606
2 1.0 201008.0 40.98 2.604 212.289143 7.606
3 1.0 201009.0 47.68 2.651 212.319643 7.606
4 1.0 201010.0 47.55 2.715 212.350143 7.606
In [ ]:
trainData_macroEconomicData = pd.merge(Train_Data, Macro_Economic_Data, on=['Outlet', 'YEARWK'], how='left')
trainData_macroEconomicData.head()
Out[ ]:
Outlet Category YEARWK Weekly_Sales Temp Gas_price_in_dollars Consumer_price_Index UnEmployment_rate
0 1 1 201006 24930.50 43.36 2.662 212.096358 7.606
1 1 1 201007 46045.49 39.56 2.638 212.242170 7.606
2 1 1 201008 41601.55 40.98 2.604 212.289143 7.606
3 1 1 201009 19409.54 47.68 2.651 212.319643 7.606
4 1 1 201010 21833.90 47.55 2.715 212.350143 7.606
In [ ]:
trainData_macroEconomicData_holidayData = pd.merge(trainData_macroEconomicData, holiday_calendar, on=['YEARWK'], how='left')
trainData_macroEconomicData_holidayData.head()
Out[ ]:
Outlet Category YEARWK Weekly_Sales Temp Gas_price_in_dollars Consumer_price_Index UnEmployment_rate Is_A_Holiday_Flag
0 1 1 201006 24930.50 43.36 2.662 212.096358 7.606 False
1 1 1 201007 46045.49 39.56 2.638 212.242170 7.606 True
2 1 1 201008 41601.55 40.98 2.604 212.289143 7.606 False
3 1 1 201009 19409.54 47.68 2.651 212.319643 7.606 False
4 1 1 201010 21833.90 47.55 2.715 212.350143 7.606 False
In [ ]:
trainData_macroEconomicData_holidayData_outlet = pd.merge(trainData_macroEconomicData_holidayData, Outlets, on=['Outlet'], how='left')
trainData_macroEconomicData_holidayData_outlet.head()
Out[ ]:
Outlet Category YEARWK Weekly_Sales Temp Gas_price_in_dollars Consumer_price_Index UnEmployment_rate Is_A_Holiday_Flag Outlet_Type Floor_Size
0 1 1 201006 24930.50 43.36 2.662 212.096358 7.606 False A 302630
1 1 1 201007 46045.49 39.56 2.638 212.242170 7.606 True A 302630
2 1 1 201008 41601.55 40.98 2.604 212.289143 7.606 False A 302630
3 1 1 201009 19409.54 47.68 2.651 212.319643 7.606 False A 302630
4 1 1 201010 21833.90 47.55 2.715 212.350143 7.606 False A 302630
Converting YEARWK to Year and Week
In [ ]:
df = trainData_macroEconomicData_holidayData_outlet.copy(deep=True)
In [ ]:
df['YEARWK'] = df['YEARWK'].astype(str)
df['Year'] = df['YEARWK'].str[:4]
df['Week'] = df['YEARWK'].str[4:]
In [ ]:
# Viewing the shape of the data - (rows, columns)
df.shape
Out[ ]:
(382261, 13)
In [ ]:
# Summary of the dataset
print(df.info(verbose=True))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382261 entries, 0 to 382260
Data columns (total 15 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   df_index              382261 non-null  int64         
 1   Outlet                382261 non-null  int64         
 2   Category              382261 non-null  int64         
 3   YEARWK                382261 non-null  object        
 4   Weekly_Sales          382261 non-null  float64       
 5   Temp                  337992 non-null  float64       
 6   Gas_price_in_dollars  337992 non-null  float64       
 7   Consumer_price_Index  337992 non-null  float64       
 8   UnEmployment_rate     307424 non-null  float64       
 9   Is_A_Holiday_Flag     382261 non-null  bool          
 10  Outlet_Type           382261 non-null  object        
 11  Floor_Size            382261 non-null  int64         
 12  Year                  382261 non-null  object        
 13  Week                  382261 non-null  object        
 14  week_date             382261 non-null  datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(5), int64(4), object(4)
memory usage: 41.2+ MB
None

2 Exploratory Data Analysis

Back to Table of Contents


In [ ]:
# Printing all the columns with atleast one null value
df.columns[df.isna().any()].tolist()
Out[ ]:
['Temp', 'Gas_price_in_dollars', 'Consumer_price_Index', 'UnEmployment_rate']
In [ ]:
print("The number of columns with atleast one NULL value(s) are: ", len(df.columns[df.isna().any()].tolist()))
The number of columns with atleast one NULL value(s) are:  4
In [ ]:
# Visulalizing the percentage of missing values
missing = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Feature': df.columns,
                                 'missing (%)': round(missing,2)})

missing_df
Out[ ]:
Feature missing (%)
Outlet Outlet 0.00
Category Category 0.00
YEARWK YEARWK 0.00
Weekly_Sales Weekly_Sales 0.00
Temp Temp 11.58
Gas_price_in_dollars Gas_price_in_dollars 11.58
Consumer_price_Index Consumer_price_Index 11.58
UnEmployment_rate UnEmployment_rate 19.58
Is_A_Holiday_Flag Is_A_Holiday_Flag 0.00
Outlet_Type Outlet_Type 0.00
Floor_Size Floor_Size 0.00
Year Year 0.00
Week Week 0.00
In [ ]:
df.isnull().sum()
Out[ ]:
Outlet                      0
Category                    0
YEARWK                      0
Weekly_Sales                0
Temp                    44269
Gas_price_in_dollars    44269
Consumer_price_Index    44269
UnEmployment_rate       74837
Is_A_Holiday_Flag           0
Outlet_Type                 0
Floor_Size                  0
Year                        0
Week                        0
dtype: int64
In [ ]:
df = df.reset_index()
df['week_date']=(df[['Year','Week']].astype(str)
                 .apply(lambda x:datetime.datetime.strptime('-W'.join(x) + '-1', "%Y-W%W-%w"),1))

3 Visualize and Analyze the Data

Back to Table of Contents


In [ ]:
from pandas_profiling import ProfileReport          # Pandas Profile to visualize the data
In [ ]:
# Generating the profile report and feeding it into a variable
Profile = ProfileReport(df, title = 'Pandas Profiling Report', html = {'style':{'full_width':True}})
In [ ]:
# Showcasing the Pandas Profiling Report for the Zomato Dataset
Profile
Out[ ]:

In [ ]:
# We shall use the SweetViz analyze() function to generate a shareable .html file to analyze the dataset
sweetvizReport = sv.analyze(df)
sweetvizReport.show_html('SweetViz.html', open_browser=False)
Report SweetViz.html was generated.
In [ ]:
# Displaying the html file in the Google Colab notebook
IPython.display.HTML('SweetViz.html')
Out[ ]:
DataFrame
NO COMPARISON TARGET
382261
ROWS
0
DUPLICATES
126.1 MB
RAM
15
FEATURES
4
CATEGORICAL
9
NUMERICAL
2
TEXT
2.1.3
Get updates, docs & report issues here

Created & maintained by Francois Bertrand
Graphic design by Jean-Francois Hains
1
df_index
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
382,261
(100%)
ZEROES:
1
(<1%)
MAX
382k
95%
363k
Q3
287k
MEDIAN
191k
AVG
191k
Q1
96k
5%
19k
MIN
0k
RANGE
382k
IQR
191k
STD
110k
VAR
12.2B
KURT.
-1.20
SKEW
1.07e-15
SUM
73.1B
2
Outlet
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
45
(<1%)
ZEROES:
---
MAX
45.0
95%
43.0
Q3
34.0
MEDIAN
24.0
AVG
22.8
Q1
10.0
5%
2.0
MIN
1.0
RANGE
44.0
IQR
24.0
STD
13.2
VAR
173
KURT.
-1.22
SKEW
-0.041
SUM
8.7M
3
Category
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
81
(<1%)
ZEROES:
---
MAX
99.0
95%
95.0
Q3
74.0
AVG
44.7
MEDIAN
38.0
Q1
18.0
5%
4.0
MIN
1.0
RANGE
98.0
IQR
56.0
STD
30.7
VAR
940
KURT.
-1.24
SKEW
0.326
SUM
17.1M
4
YEARWK
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
143
(<1%)
2,990
<1%
-
-
201241
2,976
<1%
-
-
201240
2,967
<1%
-
-
201237
2,967
<1%
-
-
201231
2,966
<1%
-
-
201236
2,962
<1%
-
-
201239
2,962
<1%
-
-
201235
361,471
95%
-
-
(Other)
5
Weekly_Sales
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
328,448
(86%)
ZEROES:
7
(<1%)
MAX
693k
95%
62k
Q3
21k
AVG
16k
MEDIAN
8k
Q1
2k
5%
0k
MIN
-5k
RANGE
698k
IQR
18,720
STD
22,754
VAR
517.7M
KURT.
20.6
SKEW
3.15
SUM
6.2B
6
Temp
VALUES:
337,992
(88%)
MISSING:
44,269
(12%)
DISTINCT:
3,121
(<1%)
ZEROES:
---
MAX
101
95%
89
Q3
76
MEDIAN
64
AVG
62
Q1
49
5%
29
MIN
-1
RANGE
102
IQR
27.2
STD
18.3
VAR
335
KURT.
-0.581
SKEW
-0.342
SUM
21.0M
7
Gas_price_in_dollars
VALUES:
337,992
(88%)
MISSING:
44,269
(12%)
DISTINCT:
892
(<1%)
ZEROES:
---
MAX
4.56
95%
4.13
Q3
3.83
MEDIAN
3.56
AVG
3.46
Q1
3.03
5%
2.73
MIN
2.56
RANGE
2.00
IQR
0.799
STD
0.459
VAR
0.210
KURT.
-1.14
SKEW
-0.126
SUM
1.2M
8
Consumer_price_Index
VALUES:
337,992
(88%)
MISSING:
44,269
(12%)
DISTINCT:
2,587
(<1%)
ZEROES:
---
MAX
230
95%
226
Q3
216
MEDIAN
191
AVG
177
Q1
133
5%
128
MIN
126
RANGE
104
IQR
83.1
STD
40.3
VAR
1,628
KURT.
-1.82
SKEW
-0.112
SUM
60.0M
9
UnEmployment_rate
VALUES:
307,424
(80%)
MISSING:
74,837
(20%)
DISTINCT:
294
(<1%)
ZEROES:
---
MAX
14.3
95%
13.5
Q3
8.7
AVG
8.0
MEDIAN
7.9
Q1
6.9
5%
5.3
MIN
4.1
RANGE
10.2
IQR
1.81
STD
2.06
VAR
4.23
KURT.
1.90
SKEW
1.14
SUM
2.5M
10
Is_A_Holiday_Flag
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
2
(<1%)
11
Outlet_Type
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
12
Floor_Size
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
40
(<1%)
ZEROES:
---
MAX
439k
95%
415k
Q3
406k
MEDIAN
280k
AVG
276k
Q1
207k
5%
79k
MIN
70k
RANGE
369k
IQR
199k
STD
123k
VAR
15.1B
KURT.
-1.17
SKEW
-0.386
SUM
105.5B
13
Year
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
3
(<1%)
14
Week
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
53
(<1%)
15
week_date
VALUES:
382,261
(100%)
MISSING:
---
DISTINCT:
142
(<1%)
5,417
1%
-
-
2012-01-02 00:00:00
2,990
<1%
-
-
2012-10-08 00:00:00
2,976
<1%
-
-
2012-10-01 00:00:00
2,967
<1%
-
-
2012-07-30 00:00:00
2,967
<1%
-
-
2012-09-10 00:00:00
2,966
<1%
-
-
2012-09-03 00:00:00
2,962
<1%
-
-
2012-08-27 00:00:00
359,016
94%
-
-
(Other)
Associations
[Only including dataset "DataFrame"]
Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP).

Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
Associations
[Only including dataset "None"]
Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is assymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP).

Circles are the symmetrical numerical correlations (Pearson's) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.
df_index
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Outlet
1.00
Consumer_price_Index
-0.26
Floor_Size
-0.20
UnEmployment_rate
0.18
Temp
-0.13
Gas_price_in_dollars
0.10
Weekly_Sales
-0.09
Category
0.05

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.45
Year
0.01
Week
0.01
Is_A_Holiday_Flag
0.00
MOST FREQUENT VALUES

0
1
<0.1%
254839
1
<0.1%
254848
1
<0.1%
254847
1
<0.1%
254846
1
<0.1%
254845
1
<0.1%
254844
1
<0.1%
254843
1
<0.1%
254842
1
<0.1%
254841
1
<0.1%
254840
1
<0.1%
254838
1
<0.1%
254827
1
<0.1%
254837
1
<0.1%
254836
1
<0.1%
SMALLEST VALUES

0
1
<0.1%
1
1
<0.1%
2
1
<0.1%
3
1
<0.1%
4
1
<0.1%
5
1
<0.1%
6
1
<0.1%
7
1
<0.1%
8
1
<0.1%
9
1
<0.1%
10
1
<0.1%
11
1
<0.1%
12
1
<0.1%
13
1
<0.1%
14
1
<0.1%
LARGEST VALUES

382260
1
<0.1%
382259
1
<0.1%
382258
1
<0.1%
382257
1
<0.1%
382256
1
<0.1%
382255
1
<0.1%
382254
1
<0.1%
382253
1
<0.1%
382252
1
<0.1%
382251
1
<0.1%
382250
1
<0.1%
382249
1
<0.1%
382248
1
<0.1%
382247
1
<0.1%
382246
1
<0.1%
Outlet
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

df_index
1.00
Consumer_price_Index
-0.26
Floor_Size
-0.20
UnEmployment_rate
0.17
Temp
-0.13
Gas_price_in_dollars
0.10
Weekly_Sales
-0.09
Category
0.02

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.45
Year
0.01
Week
0.01
Is_A_Holiday_Flag
0.00
MOST FREQUENT VALUES

13
10,294
2.7%
10
10,135
2.7%
4
10,092
2.6%
1
10,064
2.6%
2
10,058
2.6%
24
10,048
2.6%
27
10,045
2.6%
34
10,044
2.6%
20
10,034
2.6%
6
10,031
2.6%
32
10,022
2.6%
19
9,968
2.6%
31
9,962
2.6%
28
9,933
2.6%
41
9,908
2.6%
SMALLEST VALUES

1
10,064
2.6%
2
10,058
2.6%
3
8,856
2.3%
4
10,092
2.6%
5
8,819
2.3%
6
10,031
2.6%
7
9,582
2.5%
8
9,715
2.5%
9
8,687
2.3%
10
10,135
2.7%
11
9,882
2.6%
12
9,525
2.5%
13
10,294
2.7%
14
1,613
0.4%
15
1,580
0.4%
LARGEST VALUES

45
9,457
2.5%
44
7,127
1.9%
43
6,706
1.8%
42
6,921
1.8%
41
9,908
2.6%
40
9,837
2.6%
39
9,698
2.5%
38
7,305
1.9%
37
7,151
1.9%
36
6,134
1.6%
35
9,348
2.4%
34
10,044
2.6%
33
6,381
1.7%
32
10,022
2.6%
31
9,962
2.6%
Category
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Weekly_Sales
0.15
df_index
0.05
Outlet
0.02
UnEmployment_rate
0.01
Consumer_price_Index
-0.01
Floor_Size
-0.00
Temp
0.00
Gas_price_in_dollars
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.04
Year
0.01
Week
0.00
Is_A_Holiday_Flag
0.00
MOST FREQUENT VALUES

1
5,955
1.6%
16
5,955
1.6%
46
5,955
1.6%
2
5,955
1.6%
81
5,955
1.6%
40
5,955
1.6%
38
5,955
1.6%
82
5,955
1.6%
90
5,955
1.6%
91
5,955
1.6%
92
5,955
1.6%
67
5,955
1.6%
21
5,955
1.6%
95
5,955
1.6%
79
5,955
1.6%
SMALLEST VALUES

1
5,955
1.6%
2
5,955
1.6%
3
5,955
1.6%
4
5,955
1.6%
5
5,867
1.5%
6
5,506
1.4%
7
5,955
1.6%
8
5,955
1.6%
9
5,874
1.5%
10
5,955
1.6%
11
5,954
1.6%
12
5,929
1.6%
13
5,955
1.6%
14
5,955
1.6%
16
5,955
1.6%
LARGEST VALUES

99
835
0.2%
98
5,454
1.4%
97
5,798
1.5%
96
4,581
1.2%
95
5,955
1.6%
94
5,343
1.4%
93
5,519
1.4%
92
5,955
1.6%
91
5,955
1.6%
90
5,955
1.6%
87
5,911
1.5%
85
5,575
1.5%
83
5,514
1.4%
82
5,955
1.6%
81
5,955
1.6%
YEARWK
MISSING:
---
2,990
<1%
-
-
201241
2,976
<1%
-
-
201240
2,967
<1%
-
-
201237
2,967
<1%
-
-
201231
2,966
<1%
-
-
201236
2,962
<1%
-
-
201239
2,962
<1%
-
-
201235
2,961
<1%
-
-
201227
2,961
<1%
-
-
201225
2,960
<1%
-
-
201223
2,960
<1%
-
-
201234
2,959
<1%
-
-
201243
2,959
<1%
-
-
201238
2,957
<1%
-
-
201232
2,956
<1%
-
-
201224
2,955
<1%
-
-
201233
2,952
<1%
-
-
201228
2,952
<1%
-
-
201229
2,950
<1%
-
-
201242
2,943
<1%
-
-
201222
2,941
<1%
-
-
201221
2,933
<1%
-
-
201226
2,933
<1%
-
-
201230
2,744
<1%
-
-
201152
2,738
<1%
-
-
201148
2,731
<1%
-
-
201151
2,728
<1%
-
-
201150
2,727
<1%
-
-
201207
2,723
<1%
-
-
201206
2,722
<1%
-
-
201153
2,717
<1%
-
-
201149
2,712
<1%
-
-
201209
2,702
<1%
-
-
201214
2,699
<1%
-
-
201205
2,696
<1%
-
-
201210
2,696
<1%
-
-
201216
2,696
<1%
-
-
201215
2,695
<1%
-
-
201201
2,693
<1%
-
-
201219
2,689
<1%
-
-
201202
2,687
<1%
-
-
201211
2,687
<1%
-
-
201144
2,687
<1%
-
-
201146
2,686
<1%
-
-
201147
2,684
<1%
-
-
201213
2,683
<1%
-
-
201203
2,683
<1%
-
-
201208
2,681
<1%
-
-
201212
2,677
<1%
-
-
201145
2,676
<1%
-
-
201218
2,675
<1%
-
-
201217
2,675
<1%
-
-
201204
2,674
<1%
-
-
201220
2,633
<1%
-
-
201008
2,627
<1%
-
-
201119
2,620
<1%
-
-
201120
2,619
<1%
-
-
201111
2,615
<1%
-
-
201051
2,614
<1%
-
-
201006
2,613
<1%
-
-
201052
214,935
56%
-
-
(Other)
Weekly_Sales
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Floor_Size
0.24
Category
0.15
Outlet
-0.09
df_index
-0.09
UnEmployment_rate
-0.03
Consumer_price_Index
-0.02
Temp
-0.01
Gas_price_in_dollars
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.18
Week
0.08
Year
0.01
Is_A_Holiday_Flag
0.01
MOST FREQUENT VALUES

16.0
332
<0.1%
11.0
282
<0.1%
26.0
208
<0.1%
21.0
198
<0.1%
16.47
167
<0.1%
7.0
159
<0.1%
18.0
157
<0.1%
17.97
154
<0.1%
8.0
136
<0.1%
36.0
135
<0.1%
13.0
133
<0.1%
31.0
129
<0.1%
9.0
123
<0.1%
11.88
120
<0.1%
46.0
118
<0.1%
SMALLEST VALUES

-4982.94
1
<0.1%
-3918.0
1
<0.1%
-1744.0
1
<0.1%
-1315.48
1
<0.1%
-1092.0
3
<0.1%
-1002.96
1
<0.1%
-892.0
1
<0.1%
-857.0
1
<0.1%
-792.0
4
<0.1%
-772.5
1
<0.1%
-765.9
1
<0.1%
-699.0
1
<0.1%
-692.0
3
<0.1%
-643.0
2
<0.1%
-592.0
4
<0.1%
LARGEST VALUES

693105.36
1
<0.1%
649776.18
1
<0.1%
631005.19
1
<0.1%
627968.93
1
<0.1%
422312.25
1
<0.1%
420592.57
1
<0.1%
406994.63
1
<0.1%
404251.03
1
<0.1%
393711.2
1
<0.1%
392029.02
1
<0.1%
385057.04
1
<0.1%
381078.11
1
<0.1%
369836.98
1
<0.1%
368490.19
1
<0.1%
360146.66
1
<0.1%
Temp
MISSING:
44,269
(12%)
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Gas_price_in_dollars
0.15
Outlet
-0.13
df_index
-0.13
Floor_Size
-0.10
UnEmployment_rate
0.10
Consumer_price_Index
0.06
Weekly_Sales
-0.01
Category
0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Week
0.50
Outlet_Type
0.16
Is_A_Holiday_Flag
0.09
Year
0.06
MOST FREQUENT VALUES

51.48
581
0.2%
68.92
539
0.2%
77.08
505
0.1%
77.72
501
0.1%
65.1
471
0.1%
65.26
460
0.1%
53.32
457
0.1%
73.67
448
0.1%
60.02
425
0.1%
51.61
419
0.1%
63.67
407
0.1%
51.86
407
0.1%
71.33
396
0.1%
76.05
391
0.1%
71.24
386
0.1%
SMALLEST VALUES

-1.01
67
<0.1%
6.59
66
<0.1%
8.51
69
<0.1%
10.56
68
<0.1%
10.6
67
<0.1%
11.14
64
<0.1%
11.16
66
<0.1%
11.29
67
<0.1%
11.58
70
<0.1%
11.96
66
<0.1%
12.22
134
<0.1%
12.37
65
<0.1%
13.24
72
<0.1%
14.03
66
<0.1%
14.34
69
<0.1%
LARGEST VALUES

101.19
43
<0.1%
101.12
46
<0.1%
100.71
46
<0.1%
100.27
180
<0.1%
100.25
44
<0.1%
99.48
41
<0.1%
99.2
47
<0.1%
98.71
41
<0.1%
98.65
48
<0.1%
98.23
182
<0.1%
98.22
43
<0.1%
98.09
42
<0.1%
97.98
45
<0.1%
97.84
47
<0.1%
97.51
43
<0.1%
Gas_price_in_dollars
MISSING:
44,269
(12%)
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Temp
0.15
Outlet
0.10
df_index
0.10
Consumer_price_Index
-0.04
UnEmployment_rate
-0.02
Floor_Size
-0.02
Weekly_Sales
-0.00
Category
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Year
0.28
Outlet_Type
0.24
Week
0.10
Is_A_Holiday_Flag
0.03
MOST FREQUENT VALUES

3.728
1,968
0.6%
3.72
1,827
0.5%
3.981
1,523
0.5%
3.756
1,475
0.4%
3.613
1,466
0.4%
3.684
1,441
0.4%
2.861
1,440
0.4%
2.81
1,419
0.4%
3.614
1,408
0.4%
3.219
1,373
0.4%
3.71
1,360
0.4%
3.932
1,342
0.4%
2.87
1,319
0.4%
3.202
1,305
0.4%
3.326
1,285
0.4%
SMALLEST VALUES

2.562
37
<0.1%
2.603
44
<0.1%
2.604
720
0.2%
2.61
38
<0.1%
2.623
41
<0.1%
2.629
37
<0.1%
2.63
143
<0.1%
2.632
44
<0.1%
2.635
38
<0.1%
2.638
716
0.2%
2.64
257
<0.1%
2.651
756
0.2%
2.652
43
<0.1%
2.655
704
0.2%
2.657
43
<0.1%
LARGEST VALUES

4.558
368
0.1%
4.539
358
0.1%
4.398
168
<0.1%
4.391
360
0.1%
4.384
363
0.1%
4.383
192
<0.1%
4.378
172
<0.1%
4.372
173
<0.1%
4.367
357
0.1%
4.363
366
0.1%
4.344
364
0.1%
4.34
193
<0.1%
4.312
194
<0.1%
4.301
212
<0.1%
4.293
214
<0.1%
Consumer_price_Index
MISSING:
44,269
(12%)
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

df_index
-0.26
Outlet
-0.26
Floor_Size
-0.12
Temp
0.06
UnEmployment_rate
-0.04
Gas_price_in_dollars
-0.04
Weekly_Sales
-0.02
Category
-0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.26
Year
0.03
Week
0.02
Is_A_Holiday_Flag
0.00
MOST FREQUENT VALUES

133.7160968
1,595
0.5%
140.1226129
760
0.2%
130.1677097
618
0.2%
130.0663
618
0.2%
130.0357097
617
0.2%
130.0490323
616
0.2%
130.1338387
616
0.2%
130.1507742
616
0.2%
130.0894
615
0.2%
130.0756774
614
0.2%
130.1125
614
0.2%
130.0432
614
0.2%
130.0623548
613
0.2%
130.0444333
611
0.2%
130.089
607
0.2%
SMALLEST VALUES

126.064
47
<0.1%
126.0766452
48
<0.1%
126.0854516
47
<0.1%
126.0892903
48
<0.1%
126.1019355
50
<0.1%
126.1069032
49
<0.1%
126.1119032
49
<0.1%
126.114
50
<0.1%
126.1145806
49
<0.1%
126.1266
48
<0.1%
126.1283548
49
<0.1%
126.1360645
49
<0.1%
126.1392
48
<0.1%
126.1454667
51
<0.1%
126.1498065
49
<0.1%
LARGEST VALUES

229.9764563
179
<0.1%
229.8892482
61
<0.1%
229.8020401
60
<0.1%
229.7796682
198
<0.1%
229.7298638
393
0.1%
229.714832
60
<0.1%
229.6926456
66
<0.1%
229.6428882
131
<0.1%
229.6276239
60
<0.1%
229.605623
67
<0.1%
229.5559125
133
<0.1%
229.5186005
67
<0.1%
229.4981646
60
<0.1%
229.4689369
130
<0.1%
229.4315779
66
<0.1%
UnEmployment_rate
MISSING:
74,837
(20%)
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

df_index
0.18
Outlet
0.17
Temp
0.10
Floor_Size
-0.07
Consumer_price_Index
-0.04
Weekly_Sales
-0.03
Gas_price_in_dollars
-0.02
Category
0.01

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Year
0.10
Outlet_Type
0.07
Week
0.03
Is_A_Holiday_Flag
0.01
MOST FREQUENT VALUES

8.684
2,697
0.9%
7.287
2,672
0.9%
8.099
2,629
0.9%
13.503
2,592
0.8%
14.313
2,577
0.8%
7.343
2,518
0.8%
12.187
2,507
0.8%
11.627
2,502
0.8%
10.926
2,493
0.8%
12.89
2,479
0.8%
13.736
2,401
0.8%
14.099
2,384
0.8%
14.18
2,358
0.8%
8.744
2,272
0.7%
14.021
2,205
0.7%
SMALLEST VALUES

4.125
1,831
0.6%
4.145
562
0.2%
4.156
1,815
0.6%
4.261
1,829
0.6%
4.42
1,843
0.6%
4.584
1,932
0.6%
4.781
1,764
0.6%
4.922
250
<0.1%
4.954
251
<0.1%
5.103
802
0.3%
5.114
1,640
0.5%
5.124
278
<0.1%
5.217
214
<0.1%
5.277
810
0.3%
5.287
1,891
0.6%
LARGEST VALUES

14.313
2,577
0.8%
14.18
2,358
0.8%
14.099
2,384
0.8%
14.021
2,205
0.7%
13.975
1,495
0.5%
13.736
2,401
0.8%
13.503
2,592
0.8%
12.89
2,479
0.8%
12.187
2,507
0.8%
11.627
2,502
0.8%
10.926
2,493
0.8%
10.641
966
0.3%
10.581
908
0.3%
10.524
910
0.3%
10.409
831
0.3%
Is_A_Holiday_Flag
MISSING:
---
TOP CATEGORIES

False
355,514
93%
True
26,747
7%
ALL
382,261
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
Is_A_Holiday_Flag
PROVIDES INFORMATION ON...

Week
0.05
Year
0.00
Outlet_Type
0.00

THESE FEATURES
GIVE INFORMATION
ON Is_A_Holiday_Flag:

Week
0.75
Year
0.01
Outlet_Type
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
Is_A_Holiday_Flag
CORRELATION RATIO WITH...

Temp
0.09
Gas_price_in_dollars
0.03
Weekly_Sales
0.01
UnEmployment_rate
0.01
Floor_Size
0.00
Category
0.00
df_index
0.00
Outlet
0.00
Consumer_price_Index
0.00
Outlet_Type
MISSING:
---
TOP CATEGORIES

A
203,437
53%
B
136,516
36%
C
42,308
11%
ALL
382,261
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
Outlet_Type
PROVIDES INFORMATION ON...

Year
0.00
Week
0.00
Is_A_Holiday_Flag
0.00

THESE FEATURES
GIVE INFORMATION
ON Outlet_Type:

Year
0.00
Week
0.00
Is_A_Holiday_Flag
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
Outlet_Type
CORRELATION RATIO WITH...

Floor_Size
0.81
df_index
0.45
Outlet
0.45
Consumer_price_Index
0.26
Gas_price_in_dollars
0.24
Weekly_Sales
0.18
Temp
0.16
UnEmployment_rate
0.07
Category
0.04
Floor_Size
MISSING:
---
>
NUMERICAL ASSOCIATIONS
(PEARSON, -1 to 1)

Weekly_Sales
0.24
df_index
-0.20
Outlet
-0.20
Consumer_price_Index
-0.12
Temp
-0.10
UnEmployment_rate
-0.07
Gas_price_in_dollars
-0.02
Category
-0.00

CATEGORICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)

Outlet_Type
0.81
Year
0.01
Week
0.01
Is_A_Holiday_Flag
0.00
MOST FREQUENT VALUES

79380
20,607
5.4%
79820
20,412
5.3%
407638
20,016
5.2%
439244
10,294
2.7%
253024
10,135
2.7%
411726
10,092
2.6%
302630
10,064
2.6%
404614
10,058
2.6%
408368
10,045
2.6%
316228
10,044
2.6%
407484
10,034
2.6%
405010
10,031
2.6%
406014
10,022
2.6%
407500
9,962
2.6%
412604
9,933
2.6%
SMALLEST VALUES

69750
8,819
2.3%
74784
8,856
2.3%
79380
20,607
5.4%
79820
20,412
5.3%
82124
6,706
1.8%
85976
7,098
1.9%
114394
1,571
0.4%
141426
9,582
2.5%
186376
1,598
0.4%
187276
9,275
2.4%
207362
9,348
2.4%
224476
9,525
2.5%
229066
9,870
2.6%
236442
9,457
2.5%
239114
9,508
2.5%
LARGEST VALUES

439244
10,294
2.7%
414998
9,882
2.6%
412604
9,933
2.6%
411726
10,092
2.6%
408368
10,045
2.6%
407638
20,016
5.2%
407500
9,962
2.6%
407484
10,034
2.6%
406014
10,022
2.6%
405010
10,031
2.6%
404614
10,058
2.6%
401796
1,613
0.4%
392642
9,908
2.6%
368218
9,698
2.5%
316228
10,044
2.6%
Year
MISSING:
---
TOP CATEGORIES

2011
138,746
36%
2012
121,868
32%
2010
121,647
32%
ALL
382,261
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
Year
PROVIDES INFORMATION ON...

Week
0.02
Is_A_Holiday_Flag
0.01
Outlet_Type
0.00

THESE FEATURES
GIVE INFORMATION
ON Year:

Week
0.08
Is_A_Holiday_Flag
0.00
Outlet_Type
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
Year
CORRELATION RATIO WITH...

Gas_price_in_dollars
0.28
UnEmployment_rate
0.10
Temp
0.06
Consumer_price_Index
0.03
Weekly_Sales
0.01
df_index
0.01
Outlet
0.01
Floor_Size
0.01
Category
0.01
Week
MISSING:
---
TOP CATEGORIES

41
8,195
2%
40
8,175
2%
37
8,168
2%
43
8,158
2%
39
8,157
2%
38
8,146
2%
36
8,145
2%
25
8,143
2%
42
8,136
2%
34
8,128
2%
35
8,123
2%
27
8,115
2%
31
8,110
2%
22
8,108
2%
23
8,108
2%
32
8,107
2%
24
8,103
2%
(Other)
243,936
64%
ALL
382,261
100%
CATEGORICAL ASSOCIATIONS
(UNCERTAINTY COEFFICIENT, 0 to 1)
Week
PROVIDES INFORMATION ON...

Is_A_Holiday_Flag
0.75
Year
0.08
Outlet_Type
0.00

THESE FEATURES
GIVE INFORMATION
ON Week:

Is_A_Holiday_Flag
0.05
Year
0.02
Outlet_Type
0.00

NUMERICAL ASSOCIATIONS
(CORRELATION RATIO, 0 to 1)
Week
CORRELATION RATIO WITH...

Temp
0.50
Gas_price_in_dollars
0.10
Weekly_Sales
0.08
UnEmployment_rate
0.03
Consumer_price_Index
0.02
df_index
0.01
Outlet
0.01
Floor_Size
0.01
Category
0.00
week_date
MISSING:
---
5,417
1%
-
-
2012-01-02 00:00:00
2,990
<1%
-
-
2012-10-08 00:00:00
2,976
<1%
-
-
2012-10-01 00:00:00
2,967
<1%
-
-
2012-07-30 00:00:00
2,967
<1%
-
-
2012-09-10 00:00:00
2,966
<1%
-
-
2012-09-03 00:00:00
2,962
<1%
-
-
2012-08-27 00:00:00
2,962
<1%
-
-
2012-09-24 00:00:00
2,961
<1%
-
-
2012-07-02 00:00:00
2,961
<1%
-
-
2012-06-18 00:00:00
2,960
<1%
-
-
2012-08-20 00:00:00
2,960
<1%
-
-
2012-06-04 00:00:00
2,959
<1%
-
-
2012-09-17 00:00:00
2,959
<1%
-
-
2012-10-22 00:00:00
2,957
<1%
-
-
2012-08-06 00:00:00
2,956
<1%
-
-
2012-06-11 00:00:00
2,955
<1%
-
-
2012-08-13 00:00:00
2,952
<1%
-
-
2012-07-16 00:00:00
2,952
<1%
-
-
2012-07-09 00:00:00
2,950
<1%
-
-
2012-10-15 00:00:00
2,943
<1%
-
-
2012-05-28 00:00:00
2,941
<1%
-
-
2012-05-21 00:00:00
2,933
<1%
-
-
2012-06-25 00:00:00
2,933
<1%
-
-
2012-07-23 00:00:00
2,744
<1%
-
-
2011-12-26 00:00:00
2,738
<1%
-
-
2011-11-28 00:00:00
2,731
<1%
-
-
2011-12-19 00:00:00
2,728
<1%
-
-
2011-12-12 00:00:00
2,727
<1%
-
-
2012-02-13 00:00:00
2,723
<1%
-
-
2012-02-06 00:00:00
2,717
<1%
-
-
2011-12-05 00:00:00
2,712
<1%
-
-
2012-02-27 00:00:00
2,702
<1%
-
-
2012-04-02 00:00:00
2,699
<1%
-
-
2012-01-30 00:00:00
2,696
<1%
-
-
2012-04-16 00:00:00
2,696
<1%
-
-
2012-03-05 00:00:00
2,696
<1%
-
-
2012-04-09 00:00:00
2,693
<1%
-
-
2012-05-07 00:00:00
2,689
<1%
-
-
2012-01-09 00:00:00
2,687
<1%
-
-
2012-03-12 00:00:00
2,687
<1%
-
-
2011-10-31 00:00:00
2,687
<1%
-
-
2011-11-14 00:00:00
2,686
<1%
-
-
2011-11-21 00:00:00
2,684
<1%
-
-
2012-03-26 00:00:00
2,683
<1%
-
-
2012-02-20 00:00:00
2,683
<1%
-
-
2012-01-16 00:00:00
2,681
<1%
-
-
2012-03-19 00:00:00
2,677
<1%
-
-
2011-11-07 00:00:00
2,676
<1%
-
-
2012-04-30 00:00:00
2,675
<1%
-
-
2012-04-23 00:00:00
2,675
<1%
-
-
2012-01-23 00:00:00
2,674
<1%
-
-
2012-05-14 00:00:00
2,633
<1%
-
-
2010-02-22 00:00:00
2,627
<1%
-
-
2011-05-09 00:00:00
2,620
<1%
-
-
2011-05-16 00:00:00
2,619
<1%
-
-
2011-03-14 00:00:00
2,615
<1%
-
-
2010-12-20 00:00:00
2,614
<1%
-
-
2010-02-08 00:00:00
2,613
<1%
-
-
2010-12-27 00:00:00
2,612
<1%
-
-
2011-04-25 00:00:00
212,323
56%
-
-
(Other)

Data Understanding:

  • Outlet: Corresponding outlet/store (There are outlets from 1 to 45)
  • Category: Corresponding category (There are categories from 1 to 99)
  • YEARWK: Corresponding Year-week combination (There is data from the year 2010, 2011 and 2012 and weeks from 06 to 43)
  • Weekly_Sales: The weekly sales at a Store-Category-YEARWK level (Range: -4,982.94 to 693,105.36)
In [ ]:
df = trainData_macroEconomicData_holidayData_outlet.copy(deep=True)
df['YEARWK'] = df['YEARWK'].astype(str)
df['Year'] = df['YEARWK'].str[:4]
df['Week'] = df['YEARWK'].str[4:]

df = df.reset_index()
df['week_date']=(df[['Year','Week']].astype(str)
                 .apply(lambda x:datetime.datetime.strptime('-W'.join(x) + '-1', "%Y-W%W-%w"),1))
In [ ]:
df_mod = df.copy(deep=True)
df.head()
Out[ ]:
index Outlet Category YEARWK Weekly_Sales Temp Gas_price_in_dollars Consumer_price_Index UnEmployment_rate Is_A_Holiday_Flag Outlet_Type Floor_Size Year Week week_date
0 0 1 1 201006 24930.50 43.36 2.662 212.096358 7.606 False A 302630 2010 06 2010-02-08
1 1 1 1 201007 46045.49 39.56 2.638 212.242170 7.606 True A 302630 2010 07 2010-02-15
2 2 1 1 201008 41601.55 40.98 2.604 212.289143 7.606 False A 302630 2010 08 2010-02-22
3 3 1 1 201009 19409.54 47.68 2.651 212.319643 7.606 False A 302630 2010 09 2010-03-01
4 4 1 1 201010 21833.90 47.55 2.715 212.350143 7.606 False A 302630 2010 10 2010-03-08
In [ ]:
df_year = df_mod.set_index(['Year'])
df_week_date = df_mod.set_index(['week_date'])

Analyzing weekly sales aggregated at a year level (mean, sum)

In [ ]:
plt.plot(df_year.groupby(df_year.index)["Weekly_Sales"].mean())
Out[ ]:
[<matplotlib.lines.Line2D at 0x7fdd86355a50>]
In [ ]:
plt.plot(df_year.groupby(df_year.index)["Weekly_Sales"].sum())
Out[ ]:
[<matplotlib.lines.Line2D at 0x7fdd8577ed50>]

Analyzing weekly sales aggregated at a week level (mean, sum)

In [ ]:
plt.plot(df_week_date.groupby(df_week_date.index)["Weekly_Sales"].mean())
Out[ ]:
[<matplotlib.lines.Line2D at 0x7fdd8675e990>]
In [ ]:
plt.plot(df_week_date.groupby(df_week_date.index)["Weekly_Sales"].sum())
Out[ ]:
[<matplotlib.lines.Line2D at 0x7fdd8588e310>]

Now that we have done the preliminary Ecploratory Data Analysis, let's list down some of our observations:

  1. There are higher weekly sales in the months of October, November and December (the highest in December) - this must be owing to the holiday season
  2. There is a high correlation between outlet type and floor size, indicating that Based on outlet type, the size must be standardized
  3. One would expect that the higher the floor size, the higher should be the weekly sales, this unfortunately is not the case
  4. There is a 0.01 correlation on whether a holiday and weekly sales. This indicates that a holiday is not an accurate indicator for higher sales (this may be for this synthetic dataset, in the real-world this would be a big red-flag)
  5. From 2010 to 2012, we see a declining trend of mean sales (year-over-year)

4. Case Study Analysis

Back to Table of Contents


Based on the problem statement of the case-study, marks are awarded for the approach, rather than the modelling.

The metrics to consider are:

  • Forecasting horizon
  • Available Sales History
  • Sales patterns (sporadicity, stability)
  • Availability of additional features

So, below is the methodology we should follow to perform:

1. Form a baseline model

Data Wrangling
  • Take the Train_Data and perform preliminary analysis
  • Check for missing values and outliers
  • Split the YEARWK column into "Year" and "Week" and make a new column "week_date" of type datetime using the pandas package
  • Make the new columns "week_date" as the index
  • Account for missing values and fill in accordingly (For eg - like mode for categorical variables)
Feature Engineering
  • For the categorical variables, namely Outlet and Category perform one-hot encoding and remove one column to avoid the Dummy Variable Trap
  • For the numerical variables, in case convergence is a required metric for the chosen model, consider normalizing and standardizing the data
  • Visualize the data, including seasonality, trend and noise
  • Consider lag variables in case they may help increase the accuracy of the models
Timeseries Analysis
  • Now, check if the time series is stationary (this can be done using the ADF Test (using the statsmodels-adffuller package in python)
  • In case the timeseries can be made more stable, perform differencing (one or more) or exponential smoothing, as required
  • We can also apply transformations such as Log Transformation to make the skewed data more normalized. It helps remove exponential variance and bings it to more acceptable ranges
  • Now, the dataset that we have is ready to be modelled, here, we can use ACF and PACF to decide the p, d and q values (If not, we can start of with Auto-ARIMA)
  • We can apply models such as ARIMA, SARIMA, SARIMAX
  1. Add New Datasets and re-model the dataset

Entity Relation (ER) Diagram

(Made by Author - Anish via draw.io) ERDig_v1

Add additional datasets
  • Add the additional datasets of Macro_Economic_Data, Holiday_Calendar, Outlet and Hierarchy (as required) and perform the steps under #1 again.

Here, it is critical to note that the stakeholdr expect the data at a category-store-week level.

We will compare this with the baseline model that we have and understand if we are getting better results and why.

  1. Generate insights

Based on this, we will decide the following:

  • Forecasting horizon
  • Available Sales History
  • Sales patterns (sporadicity, stability)
  • Availability of additional features

One thing to note here is we can predict for a longer time or more accurately if we had more sales history data.

We have commented on sales patterns in our obervations and derived addtional features such as lags in our recommendations.